Looking at case studies can be a good way to gain intuition about how to build Conv-nets. An architecture that works well on 1 task can often wor well on other tasks as well.
Full Architecture of the network:
LeNet-5 visualised
Parameter Count by layer:
Total count: 59 830. Note how most of the parameters are used by the fully connected layers.
AlexNet visualised
Full Architecture of Network:
For convenience, I summarise each row as follows:
CNN (filter-width, stride, n-filters), then Max-Pooling (filter-width, stride), then Object dimension after pooling: (width, height, channels)
Number of Parameters: 1. \(11 \times 11 \times 96 = 11616\) 2. \(5 \times 5 \times 256 = 6400\) 3. \(3 \times 3 \times 384 = 3456\) 4. \(3 \times 3 \times 384 = 3456\) 5. \(3 \times 3 \times 256 = 2304\) 6. 0 7. \(9216 \times 4096 = 37 748 736\) 8. \(4096 \times 4096 = 16 777 216\) 9. \(4096 \times 1000 = 4 096 000\)
Total Count: 58 649 184 parameters.
VGG - 16 visualised
Full Architecture:
5 Residual blocks - The activations are passed directly forward, “skipping” a layer
in a residual block, \(a^{[l+2]} = g(z^{[l+2]} + a^{[l]})\) for every even \(l\). Empirically, with a plain feed-forward network, adding layers to a network eventually has a negative effect on your model’s accuracy on the training set. However, with residual blocks, adding layers gives a consistently decreasing training error on the training set. So having residual blocks allows you to train much deeper networks (hundreds of layers).
A large network with layer l being very deep in
It makes the Identity Function easy to learn!
Consider the case where you have a very large Neural Network of residual blocks as in the image above leading to a layers \(l\) and \(l+2\). imagine there is large weight decay, leading to tiny values of \(w^{[l+2]}\) and \(b^{[l+2]}\). Then,
\[ \begin{array} {rcl} a^{[l+2]} &=& g(z^{[l+2]} + a^{[l]}) \\ &=& g(w^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]}) \\ &\approx& g(a^{[l]}) \\ &\approx& a^{[l]}. \\ \end{array} \]
The result above gives us the result that if gradients are vanishing, a layer will simply become an identity layer. This means adding 2 layers gives a network with the performance of a large network will be at least as good as a smaller network of the same architecture. This lets us add on layers without fear of worse performance on the training set. In highly parameterised plain networks, it becomes difficult to find parameters that learn the identity network.
Say \(a^{[l+2]}\) has 256 nodes and \(a^{[l]}\) only has 128. Then we define \(a^{[l+2]} = g(z^{[l+2]} + w_s a^{[l]})\) where \(w_s \in \mathbb{R}^{256 \times 128}\). This fixes the dimensions. As for \(w_s\), we can set it as a learnable parameter or hard-code it to implement zero-padding so that the extra nodes are just set to zero. No guideline is given as to which is better, but I like the elegance of zero-padding, but I guess it doesn’t work if the second layer is smaller in size.
Example ResNet for images. Note its a bunch of convolutional layers with an occasional pooling layer.
Consists of a series of Inception Layers. We will first describe a \(1 \times 1\) convolution, then go on to describe Inception Layers
\(1 \times 1\) Convolutional layer
From 2013. Although confusingly named, this type of layer basically reduces the number of channels of a layer while keeping the dimensions of each channel the same. Also remember, that even if we kept the dimensions the same, using a conv-layer has an activation function (typically ReLU), so we would still be adding some sort of nonlinearity to the network.
Inception-layer - note 1-by-1 conv used for dimension reduction. Also note for Pooling layer, the 1-by-1 conv is implemented afterwards
Inception-Network - Consists of a bunch of inception layers.